Goto

Collaborating Authors

 audio codec


Adapting Neural Audio Codecs to EEG

Kastrati, Ard, Lanzendörfer, Luca, Rigoni, Riccardo, Matilla, John Staib, Wattenhofer, Roger

arXiv.org Artificial Intelligence

EEG and audio are inherently distinct modalities, differing in sampling rate, channel structure, and scale. Yet, we show that pretrained neural audio codecs can serve as effective starting points for EEG compression, provided that the data are preprocessed to be suitable to the codec's input constraints. Using DAC, a state-of-the-art neural audio codec as our base, we demonstrate that raw EEG can be mapped into the codec's stride-based framing, enabling direct reuse of the audio-pretrained encoder-decoder. Even without modification, this setup yields stable EEG reconstructions, and fine-tuning on EEG data further improves fidelity and generalization compared to training from scratch. We systematically explore compression-quality trade-offs by varying residual codebook depth, codebook (vocabulary) size, and input sampling rate. To capture spatial dependencies across electrodes, we propose DAC-MC, a multi-channel extension with attention-based cross-channel aggregation and channel-specific decoding, while retaining the audio-pretrained initialization. Evaluations on the TUH Abnormal and Epilepsy datasets show that the adapted codecs preserve clinically relevant information, as reflected in spectrogram-based reconstruction loss and downstream classification accuracy.


ADNAC: Audio Denoiser using Neural Audio Codec

Jimon, Daniel, Vaida, Mircea, Stan, Adriana

arXiv.org Artificial Intelligence

--Audio denoising is critical in signal processing, enhancing intelligibility and fidelity for applications like restoring musical recordings. This paper presents a proof -of-concept for adapting a state -of -the -art neural audio codec, the Descript Audio Codec (DAC), for music denoising. This work overcomes the limitations of traditional architectures like U - Nets by training the model on a large-scale, custom -synthesized dataset built from diverse sources. Training is guided by a multi-objective loss function that combines time-domain, spectral, and signal -level fidelity metrics. Ultimately, this paper aims to present a PoC for high -fidelity, generative audio restoration. Noise reduction is a fundamental part of audio signal processing, substantially improving signal quality and intelligibility across domains like speech processing [1-3], music production and restoration [1], and bioacoustics analysis [2].


U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation

Yang, Xusheng, Zhou, Long, Wang, Wenfu, Hu, Kai, Feng, Shulin, Li, Chenxing, Yu, Meng, Yu, Dong, Zou, Yuexian

arXiv.org Artificial Intelligence

We propose \textbf{U-Codec}, an \textbf{U}ltra low frame-rate neural speech \textbf{Codec} that achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme compression at 5Hz typically leads to severe intelligibility and spectral detail loss, we introduce a Transformer-based inter-frame long-term dependency module and systematically explore residual vector quantization (RVQ) depth and codebook size to identify optimal configurations. Moreover, we apply U-Codec into a large language model (LLM)-based auto-regressive TTS model, which leverages global and local hierarchical architecture to effectively capture dependencies across multi-layer tokens. We extend LLM-based TTS from 3-layer RVQ at 50Hz to 32-layer RVQ at 5Hz. Experimental results demonstrate that U-Codec improves LLM-based TTS inference speed by around 3 $\times$ over high-frame-rate codecs while maintaining similarity and naturalness. These results validate the feasibility of using highly compressed 5Hz discrete tokens for fast and high-fidelity speech synthesis.


Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning

Zhao, Junchuan, Wang, Xintong, Wang, Ye

arXiv.org Artificial Intelligence

Recent advances in discrete audio codecs have significantly improved speech representation modeling, while codec language models have enabled in-context learning for zero-shot speech synthesis. Inspired by this, we propose a voice conversion (VC) model within the V ALLE-X framework, leveraging its strong in-context learning capabilities for speaker adaptation. To enhance prosody control, we introduce a prosody-aware audio codec encoder (P ACE) module, which isolates and refines prosody from other sources, improving expressiveness and control. By integrating P ACE into our VC model, we achieve greater flexibility in prosody manipulation while preserving speaker timbre. Experimental evaluation results demonstrate that our approach outperforms baseline VC systems in prosody preservation, timbre consistency, and overall naturalness, surpassing baseline VC systems.


Spectrogram Patch Codec: A 2D Block-Quantized VQ-VAE and HiFi-GAN for Neural Speech Coding

Chary, Luis Felipe, Ramirez, Miguel Arjona

arXiv.org Artificial Intelligence

We present a neural speech codec that challenges the need for complex residual vector quantization (RVQ) stacks by introducing a simpler, single-stage quantization approach. Our method operates directly on the mel-spectrogram, treating it as a 2D data and quantizing non-overlapping 4x4 patches into a single, shared codebook. This patchwise design simplifies the architecture, enables low-latency streaming, and yields a discrete latent grid. To ensure high-fidelity synthesis, we employ a late-stage adversarial fine-tuning for the VQ-VAE and train a HiFi-GAN vocoder from scratch on the codec's reconstructed spectrograms. Operating at approximately 7.5 kbits/s for 16 kHz speech, our system was evaluated against several state-of-the-art neural codecs using objective metrics such as STOI, PESQ, MCD, and ViSQOL. The results demonstrate that our simplified, non-residual architecture achieves competitive perceptual quality and intelligibility, validating it as an effective and open foundation for future low-latency codec designs.


L3AC: Towards a Lightweight and Lossless Audio Codec

Zhai, Linwei, Ding, Han, Zhao, Cui, wang, fei, Wang, Ge, Zhi, Wang, Xi, Wei

arXiv.org Artificial Intelligence

Neural audio codecs have recently gained traction for their ability to compress high-fidelity audio and provide discrete tokens for generative modeling. However, leading approaches often rely on resource-intensive models and complex multi-quantizer architectures, limiting their practicality in real-world applications. In this work, we introduce L3AC, a lightweight neural audio codec that addresses these challenges by leveraging a single quantizer and a highly efficient architecture. To enhance reconstruction fidelity while minimizing model complexity, L3AC explores streamlined convolutional networks and local Transformer modules, alongside TConv--a novel structure designed to capture acoustic variations across multiple temporal scales. Despite its compact design, extensive experiments across diverse datasets demonstrate that L3AC matches or exceeds the reconstruction quality of leading codecs while reducing computational overhead by an order of magnitude. The single-quantizer design further enhances its adaptability for downstream tasks. The source code is publicly available at https://github.com/zhai-lw/L3AC.


NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference

Casanova, Edresson, Neekhara, Paarth, Langman, Ryan, Hussain, Shehzeen, Ghosh, Subhankar, Yang, Xuesong, Jukić, Ante, Li, Jason, Ginsburg, Boris

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens, enabling the application of language modeling techniques to speech data. However, existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models. To address this, there is growing interest in low frame-rate audio codecs, which reduce the number of autoregressive steps required to generate one second of audio. In this paper, we conduct ablation studies to examine the impact of frame rate, bitrate, and causality on codec reconstruction quality. Based on our findings, we introduce NanoCodec, a state-of-the-art audio codec that achieves high-quality compression at just 12.5 frames per second (FPS). NanoCodec outperforms related works across various bitrate ranges, establishing a new benchmark for low-latency and efficient Speech LLM training and inference.


SpectroStream: A Versatile Neural Codec for General Audio

Li, Yunpeng, Han, Kehang, McWilliams, Brian, Borsos, Zalan, Tagliasacchi, Marco

arXiv.org Artificial Intelligence

We propose SpectroStream, a full-band multi-channel neural audio codec. Successor to the well-established SoundStream, SpectroStream extends its capability beyond 24 kHz monophonic audio and enables high-quality reconstruction of 48 kHz stereo music at bit rates of 4--16 kbps. This is accomplished with a new neural architecture that leverages audio representation in the time-frequency domain, which leads to better audio quality especially at higher sample rate. The model also uses a delayed-fusion strategy to handle multi-channel audio, which is crucial in balancing per-channel acoustic quality and cross-channel phase consistency.


Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations

Han, Yichen, Hao, Xiaoyang, Chen, Keming, Xiong, Weibo, He, Jun, Zhang, Ruonan, Cao, Junjie, Liu, Yue, Li, Bowen, Zhang, Dongrui, Xia, Hui, Fu, Huilei, Jia, Kai, Guo, Kaixuan, Jin, Mingli, Meng, Qingyun, Ma, Ruidong, Fang, Ruiqian, Guo, Shaotong, Li, Xuhui, Xiang, Yang, Zhang, Ying, Liu, Yulong, Li, Yunfeng, Zhang, Yuyi, Zhou, Yuze, Wang, Zhen, Chen, Zhaowen

arXiv.org Artificial Intelligence

Text-to-speech (TTS) synthesis has seen renewed progress under the discrete modeling paradigm. Existing autoregressive approaches often rely on single-codebook representations, which suffer from significant information loss. Even with post-hoc refinement techniques such as flow matching, these methods fail to recover fine-grained details (e.g., prosodic nuances, speaker-specific timbres), especially in challenging scenarios like singing voice or music synthesis. We propose QTTS, a novel TTS framework built upon our new audio codec, QDAC. The core innovation of QDAC lies in its end-to-end training of an ASR-based auto-regressive network with a GAN, which achieves superior semantic feature disentanglement for scalable, near-lossless compression. QTTS models these discrete codes using two innovative strategies: the Hierarchical Parallel architecture, which uses a dual-AR structure to model inter-codebook dependencies for higher-quality synthesis, and the Delay Multihead approach, which employs parallelized prediction with a fixed delay to accelerate inference speed. Our experiments demonstrate that the proposed framework achieves higher synthesis quality and better preserves expressive content compared to baseline. This suggests that scaling up compression via multi-codebook modeling is a promising direction for high-fidelity, general-purpose speech and audio generation.


MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation

Song, Yakun, Chen, Jiawei, Zhuang, Xiaobin, Du, Chenpeng, Ma, Ziyang, Wu, Jian, Cong, Jian, Jia, Dongya, Chen, Zhuo, Wang, Yuping, Wang, Yuxuan, Chen, Xie

arXiv.org Artificial Intelligence

Neural audio codecs have made significant strides in efficiently mapping raw audio waveforms into discrete token representations, which are foundational for contemporary audio generative models. However, most existing codecs are optimized primarily for reconstruction quality, often at the expense of the downstream modelability of the encoded tokens. Motivated by the need to overcome this bottleneck, we introduce $\textbf{MagiCodec}$, a novel single-layer, streaming Transformer-based audio codec. MagiCodec is designed with a multistage training pipeline that incorporates Gaussian noise injection and latent regularization, explicitly targeting the enhancement of semantic expressiveness in the generated codes while preserving high reconstruction fidelity. We analytically derive the effect of noise injection in the frequency domain, demonstrating its efficacy in attenuating high-frequency components and fostering robust tokenization. Extensive experimental evaluations show that MagiCodec surpasses state-of-the-art codecs in both reconstruction quality and downstream tasks. Notably, the tokens produced by MagiCodec exhibit Zipf-like distributions, as observed in natural languages, thereby improving compatibility with language-model-based generative architectures. The code and pre-trained models are available at https://github.com/Ereboas/MagiCodec.